Despite its publication nearly 25 years ago, the controversial book The Bell Curve (TBC)1 has persisted in conversations about IQ. The authors, Richard Herrnstein and Charles Murray, claim that IQ strongly influences many significant life events – like whether someone will enter poverty, drop out of school, get married, or even go to jail. Even more contentious, they extend the consequences of low IQ to demographics like ethnicity and nationality. As evidence, the authors build and discuss dozens of probability models using public data.
TBC stands out not for being a stuffy academic paper, but instead as accessible prose intended for the general population. The book is political, pop-psychology by design. TBC can be simplified into 3 sections: a summary of prior IQ research, a presentation of new probability models derived by the authors, and a discussion of the findings including recommendations for public policy. With this being an accessible book, the probability models are described in simple, interpretable terms. In fact, TBC includes extensive discussion on what multivariate analysis means – from explaining what a “variable” means in statistics to what logistic regression attempts to model. The authors even include an entire appendix titled, “Statistics for People Who Are Sure They Can’t Learn Statistics”.
The technically precise description of America’s fertility policy is that it subsidizes births among poor women, who are also disproportionately at the low end of the intelligence distribution. We urge generally that these policies, represented by the extensive network of cash and services for low-income women who have babies, be ended.
Reallocate some portion of existing elementary and secondary school federal aid away from programs for the disadvantaged to programs for the gifted… At present, there is an overwhelming tilt toward enriching the education of children from the low end of the cognitive ability distribution. We propose more of a balance across the cognitive ability distribution.
The rules that currently govern immigration provide the other major source of dysgenic pressure. It appears that the mean IQ of immigrants in the 1980s works out to about 95. The low IQ may not be a problem: in the past, immigrants have sometimes shown large increases on such measures. But other evidence indicates that the self-selection process that used to attract the classic American immigrant – brave, hard working, imaginative, self-starting, and often of high IQ – has been changing, and with it the nature of some of the immigrant population.
It should be among the goals of public policy to shift the flow of immigrants away from those admitted under the nepotistic rules (which broadly encourage the reunification of relatives) and toward those admitted under competency rules…
The data used to evaluate the relationship between IQ and each of these binary targets is the National Longitudinal Survey of Youth 1979 (NLSY79)2, which is a public dataset made available by the Bureau of Labor Statistics. In this academic study, 12,686 young Americans (aged 14 – 21 at the survey’s start) of varying backgrounds were interviewed each year between 1979 through 1994. Although the survey continued after this time, the last relevant year for TBC was 1990. These interviews covered a vast number of topics which were tabulated and, more recently, released online.
Because of the book’s controversy and age, much has been said about TBC. This criticism has primarily come from social psychologists and other experts in intelligence. From my perspective, their comments seem embedded in psychology. E.g., from James Heckman’s “Lessons from the Bell Curve”4:
[The derived zAFQT89 field] is not the same as the g that can be extracted from test scores available in their data set… They do not emphasize how little of the variation in social outcomes is explained by ADQT or g.”
AFQT is an achievement test that can be manipulated by educational interventions… A person’s AFQT score is not an immutable characteristic beyond environmental manipulation.
While there is nothing wrong with these responses, I am surprised how few of them review the probabilities models described in TBC. In fact, I have only found one detailed statistical review of a TBC model.
In summary, this web page describes the first logistic regression published in The Bell Curve, in which POVERTY status–either above or below an official level–was predicted by a model consisting of AFQT, SES, and AGE scores from a decade earlier.
I replicated the numbers published in the book’s Appendix 4. I then looked at how well the model fit: it didn’t, not at all in the published sample (N=3367), not much in another independent one (N=1067). I suspect HM didn’t realize they’d made a Type I error, that their model’s predictive accuracy for cases living below the POVERTY level was 0%…
Extraordinary claims require extraordinary evidence. When a recommendation for public policy can negatively impact a group of people, material evidence is required to support the claim. With TBC, they cite their probability models as reason to accept their conclusions. In order to assess if these models are meaningful and reliable, they should be evaluated from the perspectives of those who could be harmed by the underlying claims. | This evaluation first requires that the potentially harmed population be well-defined. Revisiting TBC’s claims, the authors say (with my added emphasis):
The technically precise description of America’s fertility policy is that it subsidizes births among poor women, who are also disproportionately at the low end of the intelligence distribution.
By saying “disproportionately”, the authors imply that there is some IQ value below which the negative consequences of their probability models become most severe and worthy of political action. More specifically, if the authors are to objectively identify individuals they think have too low of IQ, they would need to state such an IQ value. And if their probability models are to be taken seriously, then the authors should use them to clarify what “disproportionate” means.
While the prior section may sound trivial to a statistician or data scientist, my point is that fairness recommends a classification analysis. In this section, I go further by claiming that TBC’s probability models should only be evaluated as predictive classifications based on the statistical philosophy best describing TBC’s models.
…[Explanation and prediction] are often conflated, yet the causal versus predictive distinction has a large impact on each step of the statistical modeling process and on its consequences. Although not explicitly stated in the statistics methodology literature, applied statisticians instinctively sense that predicting and explaining are different_.
At a high-level, explanatory models:
Whereas predictive models:
How do these criteria play out in TBC? Quotes from the authors reflect their thoughts on explanatory modeling in TBC:
Part II was circumscribed, taking on social behaviors one at a time, focusing on causal roles…
If after looking at a variety of these other things which both theory and common sense say should have some bearing on whether a person ends up in poverty, but one ends up with a large, statistically independent role for I.Q., it seems to me to make a causal statement that I.Q. looks like its a cause of poverty, it is a reasonable thing to do.7
By the end of the chapter [on poverty], we will have drawn a controversial conclusion. How did we get there? What makes us think that we have got our causal ordering right? We will walk through the analyses that lie behind our conclusions…
Even an inherently strong relationship can result in low values of R2 if the data points are bunched in various ways, and relatively noisy relationships can result in high values if the sample includes disproportionate numbers of outlier… We therefore consider the regression coefficients themselves (and their associated p values) to suit our analytic purposes better than R2, and that is why those are the ones we relied on in the text.
Note: The authors do have a point that R2 is not the most meaningful statistic for every model. However, they did not adequately motivate why probability models with rare events, as in the poverty model with an overall frequency of 7%, should ignore diagnostics at the model level.
Below, I comment on the role of predictive modeling in TBC:
The binary targets are not materially causal because any proposed mechanism from IQ, SES, and Age is not direct enough. Focusing again on the poverty model, many explanations are possible as to why an individual would enter poverty. Saying IQ is causal of poverty would require linking IQ with the complex process that ends with the NLSY79 participant below the poverty line. That said, such explanations would certainly be in the realm of associations, because there’s considerable chance (i.e., noise) involved.
As mentioned in the section introducing data used in TBC, most binary targets are defined to be measured nearly a decade in the future (relative to when IQ was measured). Undoubtedly, modeling such targets is a prospective process.
The probability models in TBC should prioritize the individual. In the prior section, I explained why fairness recommends the individual-focused classification analysis. Also, TBC’s authors frequently comment on the importance of acknowledging individuals and not to be judgmental: “We cannot think of a legitimate argument why any encounter between individual whites and blacks need be affected by the knowledge that an agqregate ethnic difference in measured intelligence is genetic instead of environmental.”
Overall, I claim that TBC’s probability models are more predictive than explanatory in nature and so they should be evaluated as such.
Many of the binary targets are uncommon to rare in occurrence. In logistic regression, this is called imbalanced data and can make some statistics unreliable. To counter this imbalance, I estimate Matthew’s Correlation Coefficient (MCC), which is the correlation between the binary target and the binary classification derived from a TBC probability model. MCC is robust to imbalanced data because it accounts for all possible matching scenarios8:
| Size of Correlation | Interpretation |
|---|---|
| 0.90 to 1.00 (−0.90 to −1.00) | Very high positive (negative) correlation |
| 0.70 to 0.90 (−0.70 to −0.90) | High positive (negative) correlation |
| 0.50 to 0.70 (−0.50 to −0.70) | Moderate positive (negative) correlation |
| 0.30 to 0.50 (−0.30 to −0.50) | Low positive (negative) correlation |
| 0.00 to 0.30 (0.00 to −0.30) | negligible correlation |
Since reproducing TBC’s models is a goal for this project, I want the exact data used by the authors. With the NLSY79 being a public dataset, this is possible. However, the NLSY79 uses a special variable coding system which was not noted in TBC. To credit the authors, they did release their modified datasets online and included a data dictionary10. In order to reproduce TBC’s results as closely as possible, I opted to use the authors’ data (i.e., “Nation.txt”)
Another consideration is how the training and holdout datasets are defined. The authors’ data includes a variable called “Sample”. As described in the NLSY79 User’s Guide, this Sample variable has the following 3 values:
- A cross-sectional sample of 6,111 respondents designed to be representative of the noninstitutionalized civilian segment of young people living in the United States in 1979 and born between January 1, 1957, and December 31, 1964 (ages 14–21 as of December 31, 1978)
- A supplemental sample of 5,295 respondents designed to oversample civilian Hispanic, black, and economically disadvantaged non-black/non-Hispanic youth living in the United States during 1979 and born between January 1, 1957, and December 31, 1964
- A sample of 1,280 respondents designed to represent the population born between January 1, 1957, and December 31, 1961 (ages 17–21 as of December 31, 1978), and who were enlisted in one of the four branches of the military as of September 30, 1978
For a given TBC model to reproduce, the training data is based on the cross-sectional partition (which is exactly what TBC’s authors do). The holdout data is based on the supplemental partition. This selection for a holdout dataset is motivated from Dr. Krenz’s classification analysis, where she used the supplemental sample as a holdout.
Also needed for reproduction are the exact data and model specifications. The authors listed parameter and diagnostic output for each model in appendix 4 of TBC. I used these to experiment with the “Nation.txt”" dataset in order to reproduce TBC’s models as closely as possible. 23 of the 24 reproduced models agreed with the book’s listed parameter values up to 4 decimal places or more. The last model reproduced the parameter values to a similar precision but the model’s included categorical had different values. I suspect these differences come from the authors’ use of an older version of STATA, and myself using R. Perhaps the two modeling implementations handle categorical levels differently in the design matrix, causing different parameter values.
Below I list the shorthand names I give for each TBC model with their corresponding binary target:
setwd("B:/DATA 512/Final Project")
source('200 Code/000_Misc.R')
source('200 Code/400_Model_Definitions.R')
library(knitr)
kable(tbc.models[, .(label, target.description)])
| label | target.description |
|---|---|
| poverty | Under the official poverty line in 1989 |
| poverty.hs | Under the official poverty line in 1989 |
| dropout | Permanently dropped out of high school |
| dropout_interaction | Permanently dropped out of high school |
| get_ged | Received a GED instead of a high school diploma |
| get_bachelors | Received a bachelor’s degree |
| oolf4wks | Out of the labor force for four weeks or more in 1989 |
| oolf4wks.hs | Out of the labor force for four weeks or more in 1989 |
| oolf4wks.col | Out of the labor force for four weeks or more in 1989 |
| unemployed4wks | Unemployed for four weeks or more in 1989 |
| unemployed4wks.hs | Unemployed for four weeks or more in 1989 |
| unemployed4wks.col | Unemployed for four weeks or more in 1989 |
| ever_married30 | Ever married before the age of 30 |
| ever_married30.hs | Ever married before the age of 30 |
| ever_married30.col | Ever married before the age of 30 |
| divorced_in5yrs | Divorced within the first 5 years of marriage |
| divorced_in5yrs.hs | Divorced within the first 5 years of marriage |
| divorced_in5yrs.col | Divorced within the first 5 years of marriage |
| divorced_in5yrs_parents | Divorced within the first 5 years of marriage |
| surveyed_in_jail | The subject was interviewed in jail at least once from 1979 to 1990 |
| surveyed_in_jail.hs | The subject was interviewed in jail at least once from 1979 to 1990 |
| middle_class_values | Did the subject score ‘yes’ on the Middle Class Values Index |
| middle_class_values.hs | Did the subject score ‘yes’ on the Middle Class Values Index |
| middle_class_values.col | Did the subject score ‘yes’ on the Middle Class Values Index |
Next I list the training data specifications used for this project (where the holdout data has the same definition but with Sample == ‘Supplement’):
## Note: The following variables are not present in Nation.txt. I created these to facilitate reproducing TBC's models. See TBC_Bootstrap for details.
# NOTRUN
# GEDvHSGr_Ind = ifelse(GEDvHSGr == 'GED', 1, 0)
# LTHSvHS_Ind = ifelse(LTHSvHS == 'LTHS', 1, 0)
# WedBy30_Ind = ifelse(WedBy30 == '1 Yes', 1, 0)
# NOTRUN
kable(tbc.models[, .(label, filter.training)])
| label | filter.training |
|---|---|
| poverty | Race4 == ‘White’ & Sample == ‘XSection’ & EmpSchl == 1 |
| poverty.hs | Race4 == ‘White’ & Sample == ‘XSection’ & EmpSchl == 1 & EdSample == ‘HS’ |
| dropout | Race4 == ‘White’ & Sample == ‘XSection’ & LTHSvGED != ‘GED’ & !is.na(AllvBA) |
| dropout_interaction | Race4 == ‘White’ & Sample == ‘XSection’ & LTHSvGED != ‘GED’ & !is.na(AllvBA) |
| get_ged | Race4 == ‘White’ & Sample == ‘XSection’ & LTHSvGED != ‘LTHS’ & !is.na(AllvBA) |
| get_bachelors | Race4 == ‘White’ & Sample == ‘XSection’ & !(BA_Atta == ‘InSchl/NoD’ & AllvBA == 0 & is.na(AllvBA)) |
| oolf4wks | Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ |
| oolf4wks.hs | Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & EdSample == ‘HS’ |
| oolf4wks.col | Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & EdSample == ‘College’ |
| unemployed4wks | Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & !is.na(UnempSam) |
| unemployed4wks.hs | Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & !is.na(UnempSam) & EdSample == ‘HS’ |
| unemployed4wks.col | Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & !is.na(UnempSam) & EdSample == ‘College’ |
| ever_married30 | Race4 == ‘White’ & Sample == ‘XSection’ & IntAge90 >= 30 & !is.na(EverWed) |
| ever_married30.hs | Race4 == ‘White’ & Sample == ‘XSection’ & IntAge90 >= 30 & !is.na(EverWed) & EdSample == ‘HS’ |
| ever_married30.col | Race4 == ‘White’ & Sample == ‘XSection’ & IntAge90 >= 30 & !is.na(EverWed) & EdSample == ‘College’ |
| divorced_in5yrs | Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Div5Yrs) |
| divorced_in5yrs.hs | Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Div5Yrs) & EdSample == ‘HS’ |
| divorced_in5yrs.col | Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Div5Yrs) & EdSample == ‘College’ |
| divorced_in5yrs_parents | Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Div5Yrs) & Adult14S != ’’ |
| surveyed_in_jail | Race4 == ‘White’ & Sample == ‘XSection’ & Sex == ‘Male’ |
| surveyed_in_jail.hs | Race4 == ‘White’ & Sample == ‘XSection’ & Sex == ‘Male’ & EdSample == ‘HS’ |
| middle_class_values | Race4 == ‘White’ & Sample == ‘XSection’ |
| middle_class_values.hs | Race4 == ‘White’ & Sample == ‘XSection’ & EdSample == ‘HS’ |
| middle_class_values.col | Race4 == ‘White’ & Sample == ‘XSection’ & EdSample == ‘College’ |
Lastly, I list the model specifications used for this project:
kable(tbc.models[, .(label, formula)])
| label | formula |
|---|---|
| poverty | Pov89 ~ zAFQT89 + zSES + zAge |
| poverty.hs | Pov89 ~ zAFQT89 + zSES + zAge |
| dropout | LTHSvHS_Ind ~ zAFQT89 + zSES + zAge |
| dropout_interaction | LTHSvHS_Ind ~ zAFQT89 + zSES + zAge + zAFQT89:zSES |
| get_ged | GEDvHSGr_Ind ~ zAFQT89 + zSES + zAge |
| get_bachelors | AllvBA ~ zAFQT89 + zSES + zAge |
| oolf4wks | MoOLF89 ~ zAFQT89 + zSES + zAge |
| oolf4wks.hs | MoOLF89 ~ zAFQT89 + zSES + zAge |
| oolf4wks.col | MoOLF89 ~ zAFQT89 + zSES + zAge |
| unemployed4wks | MoUnemp8 ~ zAFQT89 + zSES + zAge |
| unemployed4wks.hs | MoUnemp8 ~ zAFQT89 + zSES + zAge |
| unemployed4wks.col | MoUnemp8 ~ zAFQT89 + zSES + zAge |
| ever_married30 | WedBy30_Ind ~ zAFQT89 + zSES + zAge |
| ever_married30.hs | WedBy30_Ind ~ zAFQT89 + zSES + zAge |
| ever_married30.col | WedBy30_Ind ~ zAFQT89 + zSES + zAge |
| divorced_in5yrs | Div5Yrs ~ zAFQT89 + zSES + zAge + MarDate |
| divorced_in5yrs.hs | Div5Yrs ~ zAFQT89 + zSES + zAge + MarDate |
| divorced_in5yrs.col | Div5Yrs ~ zAFQT89 + zSES + zAge + MarDate |
| divorced_in5yrs_parents | Div5Yrs ~ zAFQT89 + zSES + zAge + Adult14S |
| surveyed_in_jail | Jail ~ zAFQT89 + zSES + zAge |
| surveyed_in_jail.hs | Jail ~ zAFQT89 + zSES + zAge |
| middle_class_values | MCV_Inde ~ zAFQT89 + zSES + zAge |
| middle_class_values.hs | MCV_Inde ~ zAFQT89 + zSES + zAge |
| middle_class_values.col | MCV_Inde ~ zAFQT89 + zSES + zAge |
Below I give more detail on how the MCC predictive evaluation is performed:
Specify a training dataset and formula consistent with a TBC model.
Fit 10,000 logistic regression models via bootstrapping (i.e., resampling the observed joint distribution with equal sizes)
Apply each bootstrapped model to the training data.
Map the modeled probabilities to a classification by selecting a cutoff value
Build a confusion matrix with the modeled classifications and actual target data.
Calculate various performance metrics for a binary target, in particular, MCC.
Repeat steps 4 - 6 for a large range of possible cutoff values (i.e., a linear grid of length 500 between 0 and 1).
Inspect the distribution of MCC across all cutoff values to find the maximum MCC value.
With the IQ factor on the x-axis, plot the average observed target and bootstrapped classifications for a set of cutoff values (including that which maximizes MCC). Use the plot to build intuition on the maximum MCC value.
Repeat all prior steps for each of the 24 training datasets and formulae.
Summarize the maximum MCC values across all reproduced TBC models and by training and holdout data.
These steps are implemented in R using the following scripts within this repository:
000_Misc.R: Attaches needed packages and defines a convenience function to report timing
100_Resampling.R: Custom functions to perform resampling techniques via bootstrapping and the jackknife method
200_Classification.R: Custom functions to create classifications from modeled probabilities and then quantify their performance
300_Visualizations.R: Custom functions to create various visualizations to understand the predictive performance of binary classifications
400_Model_Definitions.R: Parameters used to define and reproduce logistic regression models from The Bell Curve
TBC_Bootstrap.R: Recursive code that executes a batch run for each TBC probability model. This script is the workhorse of the entire project. See its code for details.
Note: When 10,000 bootstrap iterations are used on all 24 models in the prior script, the entire recursive batch process:
In this section, I present two types of results. First, I walkthrough output from the poverty model. Output from the remaining models is displayed in the appendix. Second, I summarize the optimal MCC values as evaluated on the training and holdout datasets for each model.
Before describing output from the poverty model, I import all the saved plot images and other objects from the batch run (executed by “TBC_Bootstrap.R”).
library(cowplot)
# Note: To import svg images, cowplot requires the magick package but only states so when trying to import them
# Load data
save.folder <- '100 Data'
# Create more simpler version of model labels
model.labels <- tbc.models$label
# Get filenames of all data saved from the bootstrap analysis
filenames <- list.files(save.folder)
# For bootstrap plots and output files, create named lists via model labels that stores the relevant filenames
# E.g.,
# names(filenames.plots)[1] = 'poverty'
# names(filenames.plots)[24] = 'middle_class_values.col'
filenames.plots <- vector(mode="list", length = length(model.labels))
names(filenames.plots) <- model.labels
# For each model label, store relevant filenames for either the bootstrap plots or output files
for(label in model.labels) {
filenames.plots[[label]] <- grep(pattern = paste0('^',label,'_plot'), x = filenames, perl = TRUE, value = TRUE)
}
# For bootstrap plots and output files, create named lists via model labels that stores the relevant R objects
# Within each list index of plots, create named list to denote each type of R object for easier access in the report
# E.g.,
# names(plots)[1] = 'poverty'
# names(plots[[1]])[1] <- plots.a2e
plot.type <- c(
'plots.a2e_zAFQT89'
,'plots.a2e_zAge'
,'plots.a2e_zSES'
,'plots.hist.cutoffs.training_cutoff.2stdev'
,'plots.hist.cutoffs.training_cutoff.roc'
,'plots.hist.cutoffs.training_cutoff.mcc'
,'plots.hist.cutoffs.holdout_cutoff.2stdev'
,'plots.hist.cutoffs.holdout_cutoff.roc'
,'plots.hist.cutoffs.holdout_cutoff.mcc'
,'plots.class.cutoffs.training_cutoff.2stdev'
,'plots.class.cutoffs.training_cutoff.roc'
,'plots.class.cutoffs.training_cutoff.mcc'
,'plots.class.cutoffs.holdout_cutoff.2stdev'
,'plots.class.cutoffs.holdout_cutoff.roc'
,'plots.class.cutoffs.holdout_cutoff.mcc'
,'plots.classFormat.mcc.training_plot.a2e'
,'plots.classFormat.mcc.training_plot.volume'
,'plots.classFormat.training_plot.a2e'
,'plots.classFormat.training_plot.volume'
,'plots.classFormat.mcc.holdout_plot.a2e'
,'plots.classFormat.mcc.holdout_plot.volume'
,'plots.classFormat.holdout_plot.a2e'
,'plots.classFormat.holdout_plot.volume'
,'plot.dist.mcc.training'
,'plot.dist.mcc.holdout'
,'plot.roc.training'
,'plot.roc.holdout'
,'plot.dist.mcc.training.bootCI'
,'plot.dist.mcc.holdout.bootCI'
)
plots <- vector(mode="list", length = length(model.labels))
names(plots) <- model.labels
# Import each saved SVG file and store it as a named element within the plots list
for(label in model.labels) {
plots[[label]] <- vector(mode="list", length = length(plot.type))
names(plots[[label]]) <- plot.type
for(type in plot.type) {
filename.plot.type <- paste0(save.folder,'/',label,'_',type,'.svg')
plots[[label]][[type]] <- cowplot::ggdraw() + cowplot::draw_image(magick::image_read_svg(filename.plot.type, width = 3000))
}
}
# Import the outputs list for the poverty model
output.poverty <- data.table(t(readRDS('100 Data/poverty_outputs.rds')))
Since reproducibility is critical for this project, I display the model parameters associated with the original, non-bootstrapped poverty model. These values should correspond to the ones listed on P. 620 of TBC. (The reference TBC page for each model is in the tbc.models$appendix.page field as described in the “400_Model_Definitions.R” script)
summary(output.poverty[1L, glm.model][[1]])
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2724 -0.4127 -0.3041 -0.2153 3.2655
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.64877 0.07688 -34.452 < 2e-16 ***
## zAFQT89 -0.83767 0.09351 -8.958 < 2e-16 ***
## zSES -0.33008 0.09010 -3.663 0.000249 ***
## zAge -0.02384 0.07237 -0.329 0.741863
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1750.7 on 3366 degrees of freedom
## Residual deviance: 1568.8 on 3363 degrees of freedom
## AIC: 1576.8
##
## Number of Fisher Scoring iterations: 6
The 3 plots below show the average of the poverty target and the modeled probabilities without bootstrapping across a banded version of the 3 main variables: zAFQT89, zSES, and zAge. Both zAFQT89 and zSES tend to have a monotonically decreasing relationship with the poverty target. The plot with zAFQT89 is the least noisy of the 3 and that of zAge is the most noisy. By inspection, these plots show that zAFQT89 and zSES influence the target, but do not capture a significant amount of variance in the data.
Note: I am having plotting issues in this version of the report, as shown by the excessive whitespace around plots from imported SVG files.
plots[['poverty']][['plots.a2e_zAFQT89']]
plots[['poverty']][['plots.a2e_zSES']]
plots[['poverty']][['plots.a2e_zAge']]
The 2 plots below show the bootstrapped distributions of MCC values across a dense grid of cutoff values. The first plot is the MCC distribution when evaluated on the training data, the second is that using the holdout data. The optimal MCC value is displayed above each plot.
The solid orange line represents the original sample MCC values for each cutoff. The shaded region around this line is the accelerated, bias-corrected 95% percentile intervals for the sample MCC values. The dotted orange line represents the average bootstrapped MCC value by cutoff. The blue hexagonal tiles represent a density plot of bootstrapped MCC values for that local region. Higher density tiles have a lighter color of blue.
cat(paste0('The optimal MCC value using training data = ',output.poverty[1L, mcc.optimal.training][[1]]))
## The optimal MCC value using training data = 0.225068316915218
cat(paste0('The cutoff value associated with the optimal MCC value from training data = ',output.poverty[1L, cutoffs][[1]]['cutoff.mcc']))
## The cutoff value associated with the optimal MCC value from training data = 0.15
plots[['poverty']][['plot.dist.mcc.training.bootCI']]
cat(paste0('The optimal MCC value using holdout data = ',output.poverty[1L, mcc.optimal.holdout][[1]]))
## The optimal MCC value using holdout data = 0.337074404515184
plots[['poverty']][['plot.dist.mcc.holdout.bootCI']]
The 2 plots below show the averages of the poverty target, modeled probabilities, and the classifications using the cutoff value associated with the optimal MCC value from the training data vs a banded version of zAFQT89. The first plot shows results using the training data, the second is that using the holdout data. Overall, the average classifications poorly reflect the observed values for both datasets, which is consistent with optimal MCC values below 35%.
plots[['poverty']][['plots.classFormat.mcc.training_plot.a2e']]
plots[['poverty']][['plots.classFormat.mcc.holdout_plot.a2e']]
The main output of this project is the following plot, which compares the optimal MCC values between the training and holdout datasets for each model. The x-axis represents the optimal MCC values as evaluated on the training data, the y-axis is that evaluated on the holdout data. To simplify the plot, the models are grouped together by category. With the exception of a few education-related TBC models, all other TBC models have optimal MCC values below 50% for either training or holdout data.
# Import all outputs
library(RColorBrewer)
save.folder <- '100 Data'
model.labels <- tbc.models$label
filenames.output <- paste0(save.folder, '/', model.labels, '_outputs.rds')
outputs <- rbindlist(lapply(seq_along(filenames.output), function(x) data.table(t(readRDS(filenames.output[x])))))
outputs[, label := model.labels]
categories <- c(
'Poverty'
,'Poverty'
,'Education'
,'Education'
,'Education'
,'Education'
,'Employment'
,'Employment'
,'Employment'
,'Employment'
,'Employment'
,'Employment'
,'Marriage'
,'Marriage'
,'Marriage'
,'Marriage'
,'Marriage'
,'Marriage'
,'Marriage'
,'Crime'
,'Crime'
,shQuote('Middle Class Values')
,shQuote('Middle Class Values')
,shQuote('Middle Class Values')
)
# Create category field
outputs[, Category := factor(categories, levels = unique(categories))]
# Ensure MCC fields are numeric
outputs[, mcc.optimal.training := as.numeric(mcc.optimal.training)]
outputs[, mcc.optimal.holdout := as.numeric(mcc.optimal.holdout)]
# Plot
ggplot(
outputs[, .(mcc.optimal.training, mcc.optimal.holdout, Category)]
, aes(x = mcc.optimal.training, y = mcc.optimal.holdout, group = Category, color = Category, shape = Category)
) +
geom_point(size = 4) +
scale_color_brewer(palette = "Dark2") +
xlab('Optimal MCC Value (Training Data)') +
ylab('Optimal MCC Value (Holdout Data)') +
lims(x = c(0,1), y = c(0,1)) +
theme_bw() +
theme(
axis.text=element_text(size=16)
, axis.title=element_text(size=16)
, legend.text=element_text(size=16)
, legend.title=element_text(size=16)
)
What does the plot comparing optimal MCC values mean? Using the guidelines for interpreting MCC values listed above, nearly all TBC models have a low correlation (i.e., below 50%) between their observed binary target and the bootstrapped classification when evaluated on either training or holdout data. This suggests TBC’s probability models are not materially predictive and so they inadequately support the authors’ public policy recommendations.
The only TBC models with optimal MCC values above 50% are those with binary targets related to education: receiving a GED, receiving a Bachelor’s, or dropping out of school. Again using the above guidelines for interpreting MCC values, the optimal MCC values for these models are “moderate.” While these models are relatively more predictive than the other TBC models, they alone do not warrant the authors’ recommendations.
Note: These really should be generated by a simple function, but I’ve run out of time…
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1980 -0.4354 -0.3543 -0.2759 2.9017
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.7238 0.1290 -21.111 < 2e-16 ***
## zAFQT89 -0.8267 0.1627 -5.080 3.77e-07 ***
## zSES -0.3620 0.1500 -2.413 0.0158 *
## zAge 0.1049 0.1095 0.959 0.3378
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 694.58 on 1235 degrees of freedom
## Residual deviance: 650.54 on 1232 degrees of freedom
## AIC: 658.54
##
## Number of Fisher Scoring iterations: 5
## The optimal MCC value using training data = 0.17343841842523
## The cutoff value associated with the optimal MCC value from training data = 0.08
## The optimal MCC value using holdout data = 0.312399884345027
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.60536 -0.38341 -0.20105 -0.09667 3.06586
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.85323 0.09395 -30.368 < 2e-16 ***
## zAFQT89 -1.72296 0.10281 -16.759 < 2e-16 ***
## zSES -0.64776 0.08966 -7.225 5.03e-13 ***
## zAge 0.05696 0.06883 0.828 0.408
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2347.8 on 3571 degrees of freedom
## Residual deviance: 1560.0 on 3568 degrees of freedom
## AIC: 1568
##
## Number of Fisher Scoring iterations: 6
## The optimal MCC value using training data = 0.494010257784765
## The cutoff value associated with the optimal MCC value from training data = 0.2
## The optimal MCC value using holdout data = 0.601408676867907
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4996 -0.3901 -0.1779 -0.0617 3.2502
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.91432 0.10293 -28.313 < 2e-16 ***
## zAFQT89 -1.89377 0.11884 -15.936 < 2e-16 ***
## zSES -0.94024 0.12505 -7.519 5.53e-14 ***
## zAge 0.05227 0.06827 0.766 0.443944
## zAFQT89:zSES -0.41332 0.11878 -3.480 0.000502 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2347.8 on 3571 degrees of freedom
## Residual deviance: 1547.8 on 3567 degrees of freedom
## AIC: 1557.8
##
## Number of Fisher Scoring iterations: 7
## The optimal MCC value using training data = 0.492245432165761
## The cutoff value associated with the optimal MCC value from training data = 0.21
## The optimal MCC value using holdout data = 0.594994937920605
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2512 -0.4473 -0.3507 -0.2584 2.7730
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.35485 0.06539 -36.014 < 2e-16 ***
## zAFQT89 -0.43253 0.08512 -5.081 3.75e-07 ***
## zSES -0.60822 0.08375 -7.262 3.81e-13 ***
## zAge -0.04164 0.06624 -0.629 0.53
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1974.7 on 3493 degrees of freedom
## Residual deviance: 1830.6 on 3490 degrees of freedom
## AIC: 1838.6
##
## Number of Fisher Scoring iterations: 6
## The optimal MCC value using training data = 0.174449610068079
## The cutoff value associated with the optimal MCC value from training data = 0.06
## The optimal MCC value using holdout data = 0.285684510684511
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6986 -0.5662 -0.2698 0.1912 3.6879
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.41463 0.07783 -31.026 < 2e-16 ***
## zAFQT89 1.77473 0.07825 22.680 < 2e-16 ***
## zSES 1.01514 0.06784 14.963 < 2e-16 ***
## zAge -0.30316 0.05113 -5.929 3.05e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4365.6 on 3856 degrees of freedom
## Residual deviance: 2778.9 on 3853 degrees of freedom
## AIC: 2786.9
##
## Number of Fisher Scoring iterations: 6
## The optimal MCC value using training data = 0.55399479297447
## The cutoff value associated with the optimal MCC value from training data = 0.26
## The optimal MCC value using holdout data = 0.646041223438848
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7781 -0.5032 -0.4407 -0.3839 2.5344
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.20264 0.08680 -25.376 < 2e-16 ***
## zAFQT89 -0.36247 0.09928 -3.651 0.000261 ***
## zSES 0.21788 0.10757 2.025 0.042820 *
## zAge -0.12815 0.08640 -1.483 0.138013
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1115.4 on 1685 degrees of freedom
## Residual deviance: 1096.5 on 1682 degrees of freedom
## AIC: 1104.5
##
## Number of Fisher Scoring iterations: 5
## The optimal MCC value using training data = 0.0924794744495883
## The cutoff value associated with the optimal MCC value from training data = 0.1
## The optimal MCC value using holdout data = 0.144380923743628
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7061 -0.4238 -0.3623 -0.3052 2.6075
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6978 0.1767 -15.264 <2e-16 ***
## zAFQT89 -0.4215 0.2264 -1.862 0.0627 .
## zSES 0.5649 0.2300 2.456 0.0141 *
## zAge -0.1456 0.1673 -0.870 0.3841
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 322.88 on 620 degrees of freedom
## Residual deviance: 313.96 on 617 degrees of freedom
## AIC: 321.96
##
## Number of Fisher Scoring iterations: 5
## The optimal MCC value using training data = 0.0905788678471054
## The cutoff value associated with the optimal MCC value from training data = 0.09
## The optimal MCC value using holdout data = 0.0854278704792859
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0872 -0.3847 -0.2834 -0.1921 2.9217
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1296 0.6082 -5.146 2.66e-07 ***
## zAFQT89 -0.8432 0.4527 -1.863 0.0625 .
## zSES 0.9451 0.3875 2.439 0.0147 *
## zAge -0.4606 0.2990 -1.540 0.1235
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 126.66 on 267 degrees of freedom
## Residual deviance: 113.07 on 264 degrees of freedom
## AIC: 121.07
##
## Number of Fisher Scoring iterations: 6
## The optimal MCC value using training data = 0.251892900905496
## The cutoff value associated with the optimal MCC value from training data = 0.18
## The optimal MCC value using holdout data = 0.0125472709709175
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7332 -0.4182 -0.3531 -0.2979 2.6654
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.53577 0.10760 -23.566 < 2e-16 ***
## zAFQT89 -0.49487 0.12989 -3.810 0.000139 ***
## zSES -0.02535 0.13838 -0.183 0.854661
## zAge -0.02181 0.11084 -0.197 0.843970
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 720.05 on 1396 degrees of freedom
## Residual deviance: 697.43 on 1393 degrees of freedom
## AIC: 705.43
##
## Number of Fisher Scoring iterations: 5
## The optimal MCC value using training data = 0.13636341913744
## The cutoff value associated with the optimal MCC value from training data = 0.54
## The optimal MCC value using holdout data = 0.167676842722803
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5698 -0.4232 -0.3734 -0.3369 2.4738
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5988 0.1766 -14.715 <2e-16 ***
## zAFQT89 -0.3935 0.2369 -1.661 0.0966 .
## zSES 0.1395 0.2353 0.593 0.5532
## zAge -0.1051 0.1762 -0.596 0.5509
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 284.71 on 536 degrees of freedom
## Residual deviance: 280.98 on 533 degrees of freedom
## AIC: 288.98
##
## Number of Fisher Scoring iterations: 5
## The optimal MCC value using training data = 0.152252980756548
## The cutoff value associated with the optimal MCC value from training data = 0.55
## The optimal MCC value using holdout data = 0.0934865768327446
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7832 -0.3498 -0.2655 -0.1831 2.9024
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1687 0.7277 -4.355 1.33e-05 ***
## zAFQT89 -0.9197 0.5641 -1.630 0.1031
## zSES 1.0039 0.5016 2.002 0.0453 *
## zAge 0.2942 0.3311 0.889 0.3743
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 88.152 on 227 degrees of freedom
## Residual deviance: 81.012 on 224 degrees of freedom
## AIC: 89.012
##
## Number of Fisher Scoring iterations: 6
## The optimal MCC value using training data = 0.294795327544672
## The cutoff value associated with the optimal MCC value from training data = 0.98
## The optimal MCC value using holdout data = 0.216322010592139
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0635 0.5747 0.6645 0.7173 0.9074
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.19841 0.12890 9.297 <2e-16 ***
## zAFQT89 -0.04736 0.07579 -0.625 0.5320
## zSES -0.19055 0.07863 -2.423 0.0154 *
## zAge 0.20403 0.12905 1.581 0.1139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1692.4 on 1633 degrees of freedom
## Residual deviance: 1679.5 on 1630 degrees of freedom
## AIC: 1687.5
##
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.0615536629052357
## The cutoff value associated with the optimal MCC value from training data = 0.81
## The optimal MCC value using holdout data = 0.0448060684227072
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2489 0.4522 0.5444 0.6280 0.9513
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.4149 0.2343 6.040 1.54e-09 ***
## zAFQT89 0.5142 0.1598 3.217 0.00129 **
## zSES -0.1129 0.1583 -0.713 0.47572
## zAge 0.3683 0.2423 1.520 0.12846
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 532.66 on 604 degrees of freedom
## Residual deviance: 518.81 on 601 degrees of freedom
## AIC: 526.81
##
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.169164163883082
## The cutoff value associated with the optimal MCC value from training data = 0.71
## The optimal MCC value using holdout data = 0.0359356811522763
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6383 -1.5055 0.8298 0.8615 0.9549
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.71372 0.39462 1.809 0.0705 .
## zAFQT89 0.05014 0.22375 0.224 0.8227
## zSES 0.09683 0.18337 0.528 0.5975
## zAge -0.01778 0.29509 -0.060 0.9520
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 291.06 on 236 degrees of freedom
## Residual deviance: 290.71 on 233 degrees of freedom
## AIC: 298.71
##
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.0429999820833445
## The cutoff value associated with the optimal MCC value from training data = 0.96
## The optimal MCC value using holdout data = 0.0659401887074127
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1265 -0.7064 -0.6137 -0.4877 2.2772
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.70861 1.98581 2.875 0.004044 **
## zAFQT89 -0.35734 0.07813 -4.574 4.79e-06 ***
## zSES 0.22195 0.07876 2.818 0.004831 **
## zAge -0.17767 0.07415 -2.396 0.016568 *
## MarDate -0.08677 0.02431 -3.569 0.000358 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2026.5 on 2030 degrees of freedom
## Residual deviance: 1982.7 on 2026 degrees of freedom
## AIC: 1992.7
##
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.116036786658606
## The cutoff value associated with the optimal MCC value from training data = 0.19
## The optimal MCC value using holdout data = 0.13753196318916
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8893 -0.6880 -0.6325 -0.5499 2.1149
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.44514 3.12869 1.740 0.0818 .
## zAFQT89 -0.03792 0.13481 -0.281 0.7785
## zSES 0.22069 0.12882 1.713 0.0867 .
## zAge -0.10781 0.11468 -0.940 0.3472
## MarDate -0.08400 0.03832 -2.192 0.0284 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 864.50 on 874 degrees of freedom
## Residual deviance: 857.41 on 870 degrees of freedom
## AIC: 867.41
##
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.0539839906990826
## The cutoff value associated with the optimal MCC value from training data = 0.19
## The optimal MCC value using holdout data = 0.0704713471092265
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0638 -0.4025 -0.3032 -0.2108 2.8276
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 32.39423 13.50911 2.398 0.0165 *
## zAFQT89 -0.75623 0.45023 -1.680 0.0930 .
## zSES -0.07353 0.35889 -0.205 0.8377
## zAge -0.55879 0.40470 -1.381 0.1674
## MarDate -0.41115 0.16298 -2.523 0.0116 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 107.925 on 208 degrees of freedom
## Residual deviance: 96.829 on 204 degrees of freedom
## AIC: 106.83
##
## Number of Fisher Scoring iterations: 6
## The optimal MCC value using training data = 0.203076829254132
## The cutoff value associated with the optimal MCC value from training data = 0.07
## The optimal MCC value using holdout data = 0.339836917410076
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0071 -0.6985 -0.6214 -0.5131 2.2298
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.12299 0.19200 -5.849 4.95e-09 ***
## zAFQT89 -0.39256 0.07742 -5.070 3.97e-07 ***
## zSES 0.19104 0.07833 2.439 0.0147 *
## zAge -0.02781 0.06177 -0.450 0.6526
## Adult14S2 Bio -0.31954 0.20437 -1.564 0.1179
## Adult14SBio/Step -0.11008 0.25682 -0.429 0.6682
## Adult14SOther -0.25948 0.37893 -0.685 0.4935
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2022.8 on 2028 degrees of freedom
## Residual deviance: 1989.4 on 2022 degrees of freedom
## AIC: 2003.4
##
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.0940162389447624
## The cutoff value associated with the optimal MCC value from training data = 0.19
## The optimal MCC value using holdout data = 0.111113301248484
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8179 -0.2506 -0.1819 -0.1337 3.4623
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.7772 0.1718 -21.986 < 2e-16 ***
## zAFQT89 -0.8967 0.1754 -5.113 3.17e-07 ***
## zSES -0.1555 0.1806 -0.861 0.389
## zAge 0.0783 0.1469 0.533 0.594
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 486.43 on 1944 degrees of freedom
## Residual deviance: 439.80 on 1941 degrees of freedom
## AIC: 447.8
##
## Number of Fisher Scoring iterations: 7
## The optimal MCC value using training data = 0.150823426510628
## The cutoff value associated with the optimal MCC value from training data = 0.02
## The optimal MCC value using holdout data = 0.213056160247416
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6631 -0.1559 -0.1172 -0.0893 3.2606
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.9658 0.4806 -10.332 <2e-16 ***
## zAFQT89 -1.0701 0.4431 -2.415 0.0157 *
## zSES -0.1621 0.4643 -0.349 0.7270
## zAge 0.4673 0.3678 1.271 0.2039
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 87.818 on 715 degrees of freedom
## Residual deviance: 79.701 on 712 degrees of freedom
## AIC: 87.701
##
## Number of Fisher Scoring iterations: 8
## The optimal MCC value using training data = 0.255071154213153
## The cutoff value associated with the optimal MCC value from training data = 0.99
## The optimal MCC value using holdout data = 0.0374968148960129
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0494 -1.0869 0.5879 1.0762 2.1285
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.063853 0.038934 -1.640 0.101
## zAFQT89 0.632507 0.052818 11.975 < 2e-16 ***
## zSES 0.244956 0.052062 4.705 2.54e-06 ***
## zAge 0.006637 0.040193 0.165 0.869
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4198.3 on 3028 degrees of freedom
## Residual deviance: 3874.9 on 3025 degrees of freedom
## AIC: 3882.9
##
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.27127871233613
## The cutoff value associated with the optimal MCC value from training data = 0.43
## The optimal MCC value using holdout data = 0.450510368706667
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6090 -1.3136 0.9618 1.0268 1.2166
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.39448 0.06118 6.448 1.14e-10 ***
## zAFQT89 0.16815 0.09312 1.806 0.0710 .
## zSES -0.17993 0.09034 -1.992 0.0464 *
## zAge 0.01888 0.06218 0.304 0.7614
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1569.7 on 1161 degrees of freedom
## Residual deviance: 1563.7 on 1158 degrees of freedom
## AIC: 1571.7
##
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.0588466467419308
## The cutoff value associated with the optimal MCC value from training data = 0.57
## The optimal MCC value using holdout data = 0.0577337608415983
##
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"),
## data = DT.model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0979 0.5413 0.6295 0.7029 1.0368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.99516 0.23868 4.169 3.05e-05 ***
## zAFQT89 0.39251 0.19881 1.974 0.0483 *
## zSES 0.03692 0.16858 0.219 0.8266
## zAge 0.13876 0.13364 1.038 0.2991
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 406.72 on 401 degrees of freedom
## Residual deviance: 400.18 on 398 degrees of freedom
## AIC: 408.18
##
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.130393984723014
## The cutoff value associated with the optimal MCC value from training data = 0.73
## The optimal MCC value using holdout data = 0.0679175604648386
Herrnstein and Murray. (1994). The Bell Curve. New York: The Free Press.↩
Bureau of Labor Statistics, U.S. Department of Labor. National Longitudinal Survey of Youth 1979 cohort, 1979-2012 (rounds 1-25). Produced and distributed by the Center for Human Resource Research, The Ohio State University. Columbus, OH: 2014.↩
prepared for the U.S. Department of Labor by Center for Human Resource Research, The Ohio State University. (2001). NLSY79 users’ guide : a guide to the 1979-2000 National Longitudinal Survey of Youth data. Columbus, Ohio :Center for Human Resource Research, Ohio State University.↩
Heckman, J. (1995). Lessons from The Bell Curve. Journal of Political Economy, 103(5), 1091–1120.↩
Krenz, C. (2007, August 8). Anatomy of an Analysis. Retrieved from http://www.claudiax.net/bell.html.↩
Shmueli, G. (2010). To Explain or To Predict? Statistical Science, 25(3), 289–310.↩
Miele, F. (1995). For Whom The Bell Curve Tolls. Skeptic, Volume 3, #3, 34 – 41.↩
Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12(6).↩
Mukaka MM. Statistics corner: A guide to appropriate use of correlation coefficient in medical research. Malawi Med J. 2012;24(3):69-71.↩
Herrnstein and Murray (1994). Nation.txt, Nation.hdr, and 2TBC_Documentation.ascii. Retrieved from http://www.rasmusen.org/xpacioli/bellcurve.↩